首页> 外文OA文献 >Conducting sparse feature selection on arbitrarily long phrases in text corpora with a focus on interpretability

【2h】

Conducting sparse feature selection on arbitrarily long phrases in text corpora with a focus on interpretability

机译：对文本中任意长的短语进行稀疏特征选择语料库，侧重于可解释性

代理获取

本网站仅为用户提供外文OA文献查询和代理获取服务，本网站没有原文。下单后我们将采用程序或人工为您竭诚获取高质量的原文，但由于OA文献来源多样且变更频繁，仍可能出现获取不到、文献不完整或与标题不符等情况，如果获取不到我们将提供退款服务。请知悉。

页面导航

摘要
著录项
相似文献
相关主题

摘要

We propose a general framework for topic-specific summarization of large textcorpora, and illustrate how it can be used for analysis in two quite differentcontexts: an OSHA database of fatality and catastrophe reports (to facilitatesurveillance for patterns in circumstances leading to injury or death) andlegal decisions on workers' compensation claims (to explore relevant case law).Our summarization framework, built on sparse classification methods, is acompromise between simple word frequency based methods currently in wide use,and more heavyweight, model-intensive methods such as Latent DirichletAllocation (LDA). For a particular topic of interest (e.g., mental healthdisability, or chemical reactions), we regress a labeling of documents onto thehigh-dimensional counts of all the other words and phrases in the documents.The resulting small set of phrases found as predictive are then harvested asthe summary. Using a branch-and-bound approach, this method can be extended toallow for phrases of arbitrary length, which allows for potentially richsummarization. We discuss how focus on the purpose of the summaries can informchoices of regularization parameters and model constraints. We evaluate thistool by comparing computational time and summary statistics of the resultingword lists to three other methods in the literature. We also present a new Rpackage, textreg. Overall, we argue that sparse methods have much to offer textanalysis, and is a branch of research that should be considered further in thiscontext.

机译：我们为大型文本集的特定主题汇总提供了一个通用框架，并说明了如何将其用于两个截然不同的上下文中的分析：OSHA死亡和灾难报告数据库（以促进对导致伤害或死亡的情况下的模式进行监视）和合法的我们的摘要框架建立在稀疏分类方法的基础上，在目前广泛使用的基于简单词频的方法与重量级，模型密集型方法（例如Latent DirichletAllocation（ LDA）。对于感兴趣的特定主题（例如，心理健康障碍或化学反应），我们将文档的标签回归到文档中所有其他单词和短语的高维计数上，然后将所得的少量短语用作预测性短语作为总结。使用分支定界方法，可以将该方法扩展为允许使用任意长度的短语，从而可以实现潜在的丰富摘要。我们讨论如何专注于摘要的目的可以为正则化参数和模型约束的选择提供信息。我们通过比较结果词列表的计算时间和摘要统计数据与文献中的其他三种方法来评估此工具。我们还提出了一个新的Rpackage textreg。总的来说，我们认为稀疏方法可以提供很多文本分析，这是研究的一个分支，应在此背景下进一步加以考虑。

著录项

作者
Miratrix, Luke; Ackerman, Robin;
展开▼
作者单位

展开▼
年度 2016
总页数
原文格式 PDF
正文语种
中图分类

相似文献

外文文献
中文文献
专利

1. Conducting sparse feature selection on arbitrarily long phrases in text corpora with a focus on interpretability [J] . Luke Miratrix, Robin Ackerman Statistical Analysis and Data Mining . 2016,第6期

机译：对文本语料库中的任意长短语进行稀疏特征选择，重点是可解释性
2. Conducting sparse feature selection on arbitrarily long phrases in text corpora with a focus on interpretability [J] . Miratrix Luke, Ackerman Robin Statistical Analysis and Data Mining . 2016,第6期

机译：在文本语料库中任意长的短语进行稀疏特征选择，重点是可解释性
3. Efficient and sparse feature selection for biomedical text classification via the elastic net: Application to ICU risk stratification from nursing notes [J] . Marafino Ben J., Boscardin W. John, Dudley R. Adams Journal of biomedical informatics. . 2015,第Null期

机译：通过弹性网对生物医学文本进行有效而稀疏的特征选择：从护理记录中应用于ICU风险分层
4. Learning Interpretable and Statistically Significant Knowledge from Unlabeled Corpora of Social Text Messages: A Novel Methodology of Descriptive Text Mining [C] . Giacomo Frisoni, Gianluca Moro, Antonella Carbonaro International Conference on Data Science, Technology and Applications . 2020

机译：从未标记的社会文本消息中学习可解释和统计上的知识：描述性文本挖掘的新方法
5. Feature selection for evolutionary commercial-off-the-shelf software: Studies focusing on time-to-market, innovation and hedonic-utilitarian trade-offs [D] . Kakar, Adarsh Kumar. 2013

机译：进化型现成商用软件的功能选择：针对上市时间，创新和享乐主义-功利主义权衡的研究
6. Automatic Entity Recognition and Typing from Massive Text Corpora: A Phrase and Network Mining Approach [O] . Xiang Ren, Ahmed El-Kishky, Chi Wang, -1

机译：大规模文本语料库的自动实体识别和键入：一种短语和网络挖掘方法
7. Efficient and sparse feature selection for biomedical text classification via the elastic net: Application to ICU risk stratification from nursing notes [O] . Marafino Ben J., John Boscardin W., Adams Dudley R. 2015

机译：通过弹性网对生物医学文本进行有效而稀疏的特征选择：从护理记录中应用于ICU风险分层

Conducting sparse feature selection on arbitrarily long phrases in text corpora with a focus on interpretability

摘要

著录项

相似文献

相关主题

期刊订阅